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1. INTRODUCTION 

Many animals are routinely identified by to their vocalization especially in cases where their images 
are not available. This is usually done manually by experts from field recordings. This endeavor has wide 
ranging applications in the field of natural habitat conservation. For example, the frog population is used as 
one of the bio-indicators in assessing the health of habitats such as wetlands and floodplains [1]. This has 
sparked interest in using automation to identify and classify animals from their vocalizations. This area of 
bio-acoustics signal analysis has been mostly concentrated on using techniques similar to those used for 
processing speech signals [2], [3]. Signal segmentation of bio-acoustic sounds is routinely performed in order 
to isolate syllables [4], [5]. The next step usually involved extracting relevant features, such as the popular 
Mel Frequency Cepstrum Coefficients (MFCC) [6], [7], or other simpler features based on parameters such 
as sound or call duration, maximum power, and maximum and minimum frequencies [8]. Akin to speech 
processing, once the features are identified, they are used as inputs to classifiers such as Support Vector 
Machines [9], Nearest Neighbors [10], Neural Networks [11], [12] and many others [13-15]. 

Another approach to process animal calls are based on spectrograms. Spectrograms are visual 
representations of audio signals obtained using the Short-Time Fourier Transform (STFT). Image processing 
techniques commonly used in many applications [16], have been applied to spectrograms to automatically 
analyzed animal calls [17]. Researchers in animal calls have extracted features from the spectrograms such 
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as local peaks [18] and ridges [19]. These features, or their derivatives, are then used as inputs to classifiers 
similar to those described above for speech processing. 

This paper presents a different approach to animal call classification. It investigates the possibility 
of representing animal call spectrograms as call-prints similar to fingerprints in humans. As call-prints are 
images, a classification technique based on Correlation Filters (CFs) are applied to the images. There are 
many types of CFs that have been used for image classification. They have been designed to exhibit attractive 
properties such as noise robustness, shift-invariance, distortion tolerance and gradual degradation. A such, 
they have been applied to image processing applications such as biometric classification, pedestrian 
detection, object detection and tracking [20], [21]. In these applications, a template (also known as filter) is 
carefully designed from the training images. A query image is then cross-correlated with the template to 
produce the output, where the operation is performed in the frequency domain in order to take advantage of 
the efficiencies of the Fast Fourier Transform (FFT) algorithm. 

The Maximum Margin Correlation Filter (MMCF) is considered in this paper because it has been 
shown to perform better than other well-known types of CF such as Optimal Trade-off Synthetic 
Discriminant Function (OTSDF), Unconstrained Optimal Trade-off Synthetic Discriminant Function 
(UOTSDF), Average of Synthetic Exact Filter (ASEF) and Minimum Output Sum of Squared Errors 
(MOSSE). The MMCF has been designed to provide the advantage of providing not only good classification 
but also of localization as demonstrated successfully in tasks such as vehicle recognition, eye localization and 
face classification [22], [23]. 

In this paper, the MMCF is applied to bio-acoustic signal spectrograms as the images to construct 
the templates. Several training images are used to synthesize a filter template. In order to obtain distinctive 
animal call-prints, the spectrograms have to be carefully constructed by centering the call in an image frame 
and by selecting parameters that highlight salient features of the calls. An even better representation of the 
call-print is obtained using time-frequency reassignment. To demonstrate the viability of this technique, it is 
applied to a two-class task, classifying two different frog species based on their calls. 


2. RESEARCH METHOD 

The proposed technique consists of three main parts: constructing call-prints using spectrograms, 
constructing MMCF templates using multiple call-prints and classifying animal vocalization using 
correlation plane parameters. 


2.1. Constructing call-prints 
2.1.1. Spectrogram 

It is important to obtain a good representation of an animal call in terms of its spectrogram 
representation. This call-print is dependent on the parameters used for the spectrogram, which in turn is 
dependent on the dataset. In this case, the recordings of the two species of frogs sampled at 44.1 kHz, are 
segmented into individual calls of 800 ms length using a sound editing tool. Each segment is then filtered 
with a high pass filter with a cut-off frequency of 250 Hz in order to eliminate the environmental noise. A 
centering of the peak amplitude at 400 ms is performed before applying the STFT in order to obtain the 
spectrogram. 

Framing are performed on the calls with a chosen frame length of 256 with a 75 percent overlap. 
These parameters are obtained after several trials with the objective of obtaining visually clear call-prints. 
Windowing is applied using the Gaussian window chosen due to its superior performance in eliminating 
energy leakage, even though the computation is more intensive compared to other windows [23]. The 
Gaussian window is described by 


at 25n)2 
h(n) =e 2% , 0<Inl < (1) 


where N is the window length. 
Each windowed frame is then transformed from time domain into the frequency-domain by the 
STFT to construct the spectrogram of the magnitude spectrum, also termed here as the call-prints. 


2.1.2. Time-frequency reassignment 

Time-frequency (TF) reassignment is a technique used to overcome the shortcomings of 
spectrograms due to the unfortunate trade-off between resolution in frequency and time [24]. As such, TF 
reassignment is applied to obtain an even better representation of the call-prints. TF reassignment uses the 
information from the phase spectrum to sharpen the amplitude spectrum. It is able to locate impulses, linear 
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chirps, and simple sinusoids at the actual time or frequency with a higher resolution than the inherent STFT 
spectrograms. It has been shown that TF reassignment is equivalent to moving energy up the local gradient of 
intensity of the spectrogram for Gaussian windows [25]. 

Utilizing the modified moving window method [24], consider a set of coefficients ¢(¢,@) obtained by 
decomposing a time domain signal x(t) based on a set of elementary signals h,.(t) where 


hy =noe (2) 
with h(t) being a lowpass kernel function. The coefficients of this decomposition are defined as 
e(t,w) = f x(t) A(t — tre FOC dr (3) 
resulting in 
e€(t,w) = e/”*X(t,w) =X,(w) = M,(w)ei®*™ (4) 

Here, M,(w) is the magnitude and (@) is the phase of X:(w), the Fourier transform of the signal x(t) 
shifted by time t and windowed by h(t). 

For signals exhibiting slow time variation compared to phase variation, the maximum contribution 


to the reconstruction integral comes from the vicinity of the point £@ satisfying the phase stationary 
condition 


2. [0,(w) —wt+art] =0 (5a) 
-. [0,(w) — wt + wt] = 0 (5b) 


Or equivalently around the point f, @ defined by 


i(t,w) = ae (6a) 
O(t,w) =wt ace) (6b) 


This method of reassignment changes the point of attribution of ¢(¢,@) to this point of maximum 
contribution @(t,w),@(t,q@) rather than to the point of t,@ at which it was originally computed. The 
computed time-frequency coordinates f(t,w),@(t,w) are equal to the local group delay and the local 
instantaneous frequency, respectively. These quantities are normally ignored when constructing the 
spectrogram which only considers the magnitude. 


2.2. Constructing templates 

The bio-acoustic signal spectrograms are used to construct the MMCF template. Several training 
images are used to synthesize a filter template. Each image is of size 512x512 and the template is constructed 
from multiple call-prints from the training set. A separate test set is used for the cross-correlation process 
with the template filter in order to determine whether the test image is from the true or false class. In this 
process, the MMCF optimizes a criterion to produce a desired correlation output plane by a trade-off matrix 
maximizing the margin criterion similar to Support Vector Machine (SVM), while minimizing the 
localization criterion expressed as the mean square error. As with other CFs, the MMCF can be expressed in 
a closed form solution [20]. As such, the optimization of the MMCF template can be described as 


me => = 1.n 
Aymcr =T ‘(2h Xi9i) +T 1AYa (7) 


where T is the trade-off matrix, X;is a d x d diagonal matrix form of the ith training image in the frequency 
domain with vector x; along its diagonal, g; is the 2D vector representation of the expected correlation 
output for the ith training image, A is a dx L matrix whose columns are formed by L training image vectors 
x;, Y is a diagonal matrix with class label ( 1 for true class, 0 for false class) along its diagonal, while the 
vector a is evaluated from the sequential minimum optimization technique. 
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2.3. Classification 
The template matching process for the input test image S(x,y) and a correlation filter template 
H(u,v) is given by 


c(x, y) = IFFT{FFT(S(x,y)) * H*(u,v)} (8) 


The test image is first converted to the frequency domain and then reshaped to be in the form of a 
vector. It is then convolved with the conjugate of the MMCF, or equivalently, cross correlating it with the 
MMCF. Transformation of the output to the spatial domain is required in order to obtain the correlation 
plane. 

If the test image belongs to the same class as the designed filter, the resulting correlation plane 
produces a sharp peak at the origin while the values everywhere else are close to zero. To measure the 
sharpness of the peak, the Peak-to-Sidelobe ratio (PSR) is used, where 


peak—mean 


PSR= (9) 


standard deviation 

The peak is the largest value of the test image obtained from the correlation output. The standard 
deviation and mean are calculated from a sidelobe region excluding a central mask [17]. To classify the 
frogs into the correct class, threshold value for each class is determined from the PSR values obtained with 
the cross-correlation process using the dataset. Then, the cross correlation process is performed using the test 
set. If an image has a PSR value that is greater than the threshold for the tested class, it is classified as its true 
class, otherwise it is classified as false. 


3. RESULTS AND ANALYSIS 

To demonstrate the viability of using the proposed technique to classify call-prints, two species of 
frogs commonly found in Malaysia were considered. Recordings were obtained for common grass frogs (F. 
limnocharis Boie) and mangrove frogs (F. cancrivora Gravenhorst). The calls were subsequently processed 
as described in section 2.1 to obtain the call-prints. For each species, 30 call-prints were obtained and divided 
equally into the training and testing sets. The templates for each class were constructed as described in 
section 2.2, using 5, 10 and 15 call-prints from the test-set. From the cross-correlation process of the test set 
and the templates as described in section 2.3, the accuracy rate is calculated defined as the ratio of correct 
classification to total number of test inputs. The results are tabulated in Table 1 to 3 for different number of 
call-prints per template, with and without TF reassignment. 


Table 1. Accuracy Rate with 5 Call-Prints per Template 








Species Without TF Reassignment (%) With TF 
Reassignment (%) 
F. limnocharis Boie 25.3 34.7 
F. cancrivora Gravenhorst 15.7 21.1 
Average 20.5 27.9 





Table 2. Accuracy Rate with 10 Call-Prints per Template 








Species Without TF Reassignment (%) With TF 
Reassignment (%) 
F. limnocharis Boie 45.9 60.7 
F. cancrivora Gravenhorst 60.5 73.1 
Average 53.2 66.5 





Table 3. Accuracy Rate with 15 Call-Prints per Template 








Species Without TF Reassignment (%) With TF 
Reassignment (%) 
F. limnocharis Boie 67.5 79.0 
F. cancrivora Gravenhorst T3A 86.8 
Average 70.3 82.9 
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The Tables show that in general, for cases with and without TF reassignment, the accuracy rate 
increases as the number of call-prints per template increases. The results also demonstrate that using TF 
reassignment increases the accuracy rate by more than 10 percent for the cases of 10 and 15 call-prints per 
template suggesting that this is viable technique that can be used to distinguish between animal species based 
on their vocalization spectrogram. 


4. CONCLUSION 

This paper has shown that bio-acoustic signals may be classified using MMCFs when the signals are 
converted to spectrograms in order to obtain the call-prints. Call-prints has been shown to be viable image 
inputs to MMCFs for frog classification based on their vocalization. However, careful selection of the 
spectrogram parameters is required in order to produce clear and distinguishing call-prints. Multiple call- 
prints are used to construct the MMCF template representing the species. The results of classifying two frog 
species showed that the accuracy rate increases as the number of call-prints per template increases. 
Furthermore, applying TF reassignment to the spectrograms increases the accuracy rate overall and by more 
than 10 percent for 10 and 15 call-prints per template 
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